This lecture was inspired by and/or modified in part from “Text Mining Fedspeak” by Len Kiefer

The Federal Reserve is a natural target of text mining for economists. The Federal Open Market Committee (FOMC) monetary policy statement is parsed and prodded each time the FOMC announces a change (or no change at all). For example, the Wall Street Journal provides a Fed Statement Tracker which allows you to compare changes from one FOMC statement to another. Narasimhan Jegadeesh and Di Wu have a paper “Deciphering Fedspeak: The Information Content of FOMC Meetings” paper on ssrn that uses text mining techniques on FOMC meeting minutes.

Researchers have also looked at the transcripts for FOMC meetings. San Cannon has a paper “Sentiment of the FOMC Unscripted” pdf that applies text minging tools to FOMC transcripts.

We’ll look at the Federal Reserve’s semi-annual Monetary Policy Report. This report is typically issued in February and July, with the latest report for July \(2022\). We can download pdf files for each July report from \(1996\) through \(2022\), though the url form has changed slightly.

1 Single Report

Let’s first load in a single report for July \(2022\) available at: https://www.federalreserve.gov/monetarypolicy/files/20220617_mprfullreport.pdf

We’ll use pdftools to import the pdf file:

# Importing a single PDF
v=fed_links[length(fed_links)]

fed_import=pdf_text(v)

str(fed_import)
 chr [1:77] "                                                   For use at 11:00 a.m. EDT\n                                 "| __truncated__ ...

The pdf_text function provides us with a list of strings, one for each page (there are a total of 77 pages in this report).

Let’s take a look at page \(7\)’s first \(500\) characters:

substr(fed_import[7], 1, 500)
[1] "                                                                                                    1\n\n\n\n\nSummary\nIn the first part of the year, inflation remained   Recent Economic and Financial\nwell above the Federal Open Market                  Developments\nCommittee’s (FOMC) longer-run objective\nof 2 percent, with some inflation measures          Inflation. Consumer price inflation, as\nrising to their highest levels in more than         measured by the 12-month change in the\n40 years. These "

As we can see, there’s a good amount of blank spaces and special characacters \n indicating linebreaks.

We can deal with this by splitting on \n with strsplit().

v=v %>% 
  str_replace_all(c("./pdfs/"="", "_"="", ".pdf"=""), "") %>% 
  as_date %>% format("%b%Y")

# Get the pages and then the line for each page
fed_text_raw = data.frame(
  text = fed_import, 
  stringsAsFactors = FALSE
) %>%
  mutate(
    page = row_number(),
    text = strsplit(text, "\n"),
    report = v
  ) %>% # Separate by line
  unnest(text) %>%
  group_by(report) %>%
  mutate(line = row_number()) %>%
  ungroup() 

Now we can apply the tidytext mining techniques.

fed_text=fed_text_raw %>%
  as_tibble() %>%
  tidytext::unnest_tokens(word, text)

# Nice table format
# datatable(fed_text, options = list(autoWidth = TRUE))

Let’s count up the words:

fed_text %>%
  count(word, sort = TRUE) %>%
  datatable(options = list(autoWidth = TRUE))

There are a lot of common words like: “the”, “of”, and “in”. In text mining, these words are called “stop words”. We can remove them by using anti_join and the stop_words list that comes with the tidytext package.

fed_text %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  datatable(options = list(autoWidth = TRUE))

The Fed really likes talking about rates, but we also find that there are some numbers in the text. The year \(2022\) appears a lot.

Let’s drop numbers from the text. In older reports, they liked to used fractions with special characters, so we’ll take a heavy-handed approach and only keep alphabetic characters.

fed_text_2=fed_text %>%
  mutate(word = gsub("[^A-Za-z ]", "", word)) %>%
  filter(word != "")

fed_text_2 %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  datatable(options = list(autoWidth = TRUE))

1.1 Sentiment

What’s the overall sentiment of the report? Text mining allows us to try to score text, or portions of text for sentiment. We can apply one of the sentiments datasets supplied by tidytext to score the report. For this example, we will use the bing library based on Bing Liu and collaborators.

Let’s see what the most frequently used negative and positive words are based on the bing lexicon.

tbl=fed_text_2 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

tbl %>%
  datatable(options = list(autoWidth = TRUE))

In this report, we can see that risks, a negative word, is used 43 times, appropriate, a positive word, is used 33 times. Debt, the 6 most frequent word, is considered negative, however, in an economic report it might be more descriptive than positive/negative.

1.2 Bigrams

We can apply tidytext principles to single words and a consecutive sequence of words, called n-grams.

fed_bigrams=fed_text_raw %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  as_tibble()

# fed_bigrams %>% datatable(options = list(autoWidth = TRUE))

Count the bigrams:

fed_bigrams %>%
  count(bigram, sort = TRUE) %>%
  datatable(options = list(autoWidth = TRUE))

As Silge and Robinson point out, many of the bigrams are uninteresting. Let’s filter out uninteresting bigrams that contain stop words.

bigrams_separated=fed_bigrams %>%
  separate(
    bigram,
    c("word1", "word2"), 
    sep = " "
  )

bigrams_filtered=bigrams_separated %>%
  filter(
    word1 %nin% stop_words$word,
    word2 %nin% stop_words$word
  )

bigram_counts=bigrams_filtered %>%
  count(word1, word2, sort = TRUE)

bigram_counts %>%
  datatable(options = list(autoWidth = TRUE))
# Unite them
bigrams_united = bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ") 

bigrams_united %>%
  datatable(options = list(autoWidth = TRUE))

Now, let’s find out if the Fed used the word (“gross”) mostly surrounding GDP.

bigrams_filtered %>%
  filter(word1 == "gross") %>%
  count(word2, sort = TRUE) %>%
  datatable(options = list(autoWidth = TRUE))

1.3 Revised Sentiment

After analyzing the report word frequencies I came up with a list of words that probably aren’t either negative or positive in the usual sense. I added them to the original stop_words and created a custom_stop_words.

custom_stop_words2=bind_rows(
  tibble(
    word = c("debt", "gross", "crude", "well", "maturity", "work", "marginally", "leverage"), 
    lexicon = c("custom")
  ),
  stop_words
)

fed_sentiment=fed_text %>%
  anti_join(custom_stop_words2) %>%
  inner_join(get_sentiments("bing")) %>%
  count(report, index = line %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
 
# fed_sentiment %>% datatable(options = list(autoWidth = TRUE))
ggplot(fed_sentiment, aes(index, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("red","#27408b")) +
  facet_wrap(~report, ncol = 5, scales = "free_x") +
  theme_ridges(font_family = "Arial") +
  labs(
    x = "index (approximately 3 pages per unit)",
    y = "sentiment",
    title = "Sentiment through Federal Reserve Monetary Policy Report",
    subtitle = "customized bing lexicon",
    caption= "Source: https://www.federalreserve.gov/monetarypolicy/files/20220617_mprfullreport.pdf"
  )

This trend tells an interesting story. The text beggan positive, dropped off but then surged in the middle. Around the last third of the text, near Part 3: Summary Of Economic Projections sentiment turns negative as the text describes forecasts and risks.

2 Multiple Reports

Let’s expand our analysis by capturing the text of each Monetary Policy Report for July from \(1996\) through \(2022\). We’ll compare the relative frequency of words and topics and see how sentiment (as we captured it above) varies across reports.

df_fed=fed_links %>%
  tibble(
    link=., 
    report=str_replace_all(., c("./pdfs/"="", "_"="", ".pdf"=""), "") %>% as_date %>% format("%b%Y")
  ) %>%
  mutate(text=future_map(link, pdf_text)) %>% 
  select(-link) %>%
  unnest(text) %>% 
  group_by(report) %>% 
  mutate(page = row_number()) %>%
  ungroup() %>% 
  mutate(text = strsplit(text, "\n")) %>% 
  unnest(text) %>% 
  group_by(report) %>% 
  mutate(line = row_number()) %>% 
  ungroup() %>% 
  select(report, line, page, text)
# df_fed %>% datatable(options = list(autoWidth = TRUE))

2.1 Compare word counts

Let’s start with just hte number of words per report.

fed_words=df_fed %>%
  unnest_tokens(word, text) %>%
  count(report, word, sort = TRUE) %>%
  ungroup()

total_words = fed_words %>%
  group_by(report) %>%
  summarize(total = sum(n))

total_words %>%
  datatable(options = list(autoWidth = TRUE))
ggplot(data = total_words, aes(x=as.numeric(str_match(report, "[0-9]+")), y = total))+
  geom_line(color = "#27408b")+
  geom_point(shape = 21, fill = "white", color = "#27408b", size = 3, stroke = 1.1)+
  scale_y_continuous(labels = scales::comma)+
  theme_ridges(font_family = "Arial")+
  labs(
    x = "year", 
    y = "number of words",
    title = "Number of words in Federal Reserve Monetary Policy Report",
    subtitle = "July of each year 1996-2022",
    caption = "Source: Federal Reserve Board Monetary Policy Reports"
  )

The Jul2012 report is one of the longer reports with over 33145 words. We can also see a pretty clear break at the end of the Greenspan tenure in \(2005\) as the reports got substantially longer.

2.2 What are they talking about?

Let’s compile a list of the msot frequently used words in each report. As before, we’ll omit stop words

fed_text = df_fed %>%
  select(report, page, line, text) %>%
  unnest_tokens(word, text)

fed_topic = fed_text %>%
  mutate(word = gsub("[^A-Za-z ]", "", word)) %>%  # keep only letters (drop numbers and special symbols)
  filter(word != "") %>%
  anti_join(stop_words) %>%
  group_by(report) %>%
  count(word, sort = TRUE) %>% 
  mutate(rank = row_number()) %>%
  ungroup() %>% 
  arrange(rank, report) %>%
  filter(rank < 11) 

# fed_topic %>% datatable(options = list(autoWidth = TRUE))
# Most Frequent Words
ggplot(fed_topic, aes(y = n, x = fct_reorder(word, n))) +
  geom_col(fill = "#27408b") +
  facet_wrap(~report, scales = "free", ncol = 5) +
  coord_flip() + 
  theme_ridges(font_family = "Arial", font_size = 10) +
  labs(
    x = "",
    y = "",
    title = "Most Frequent Words Federal Reserve Monetary Policy Report",
    subtitle = "Excluding stop words and numbers.",
    caption = "Source: Federal Reserve Board Monetary Policy Reports"
  )

Lots of talking about rates. Let’s see if we can get some more information out of this data.

Following Silge and Robinson we can use the bind_tf_idf function to bind the term frequency and inverse document frequency to our tidy text dataset. This statistic will decrease the weight on very common words and increase the weight on words that only appear in a few documents. In essence, extracting the most important information from each report.

We’ll also clean out some additional terms that the pdftools picked up (like monthly abbreviations) by augmenting our stop word list.

custom_stop_words = bind_rows(
  tibble(
    word = c(
      tolower(month.abb), "one","two","three","four","five","six",
      "seven","eight","nine","ten","eleven","twelve","mam","ered",
      "produc","ing","quar","ters","sug","quar",'fmam',"sug",
      "cient","thirty","pter",
      "pants","ter","ening","ances","www.federalreserve.gov",
      "tion","fig","ure","figure","src"
    ), 
    lexicon = c("custom")
  ), 
  stop_words
)

fed_text_b = fed_text %>%
  mutate(word = gsub("[^A-Za-z ]", "", word)) %>%  
  # keep only letters (drop numbers and special symbols)
  filter(word != "") %>%
  count(report, word, sort=TRUE) %>%
  bind_tf_idf(word, report, n) %>%
  arrange(desc(tf_idf))

# Remove the stop words
fed_text_b_filtered = fed_text_b %>%
  anti_join(custom_stop_words, by = "word") %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(report) %>%
  mutate(id = row_number()) %>%
  ungroup() %>%
  filter(id < 11)

fed_text_b_filtered %>%
  datatable(options = list(autoWidth = TRUE))
# Highest tf-idf Words by Report
ggplot(fed_text_b_filtered, aes(word, tf_idf, fill = report)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~report, scales = "free", ncol = 5)+
  coord_flip()+
  theme_ridges(font_family = "Arial", font_size = 10)+
  theme(axis.text.x=element_blank())+
  labs(
    x="",y ="tf-idf",
    title="Highest tf-idf words in each Federal Reserve Monetary Policy Report: 1996-2022",
    subtitle="Top 10 terms by tf-idf statistic: term frequncy and inverse document frequency",
    caption="Source: Federal Reserve Board Monetary Policy Reports\nNote: omits stop words, date abbreviations and numbers."
  )

This chart tells an interesting story, we can see the emergence of certain acronyms like JGTRRA (Jobs and Growth Tax Relief Reconciliation Act), TALF (Term Asset-Basked Securities Loan Facility) or LPFR (Labor Force Participation Rate). You can also see terms like terrorism (\(2002\)) and war (\(2003\)) associated with major geopolitical events.

The Monetary Policy Report also contains a special topic, and you can see signs of them in some of the reports. For example, the \(2016\) report has the special topic: “Have the Gains of the Economic Expansion Been Widely Shared?” that discussed economic trends across demographic groups. You can see evidence of that with the prevalence of terms like: “hispanic”, “race”, “black”, and “white” in said report.

2.3 Comparing Sentiment

How did sentiment vary across each report?

fed_sentiment = fed_text %>%
  anti_join(custom_stop_words2) %>%
  inner_join(get_sentiments("bing")) %>%
  count(report, index = line %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

fed_sentiment %>%
  datatable(options = list(autoWidth = TRUE))
# Sentiment Across the Years
ggplot(fed_sentiment, aes(index, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("red","#27408b"))+
  facet_wrap(~report, ncol = 5, scales = "free_x")+
  theme_ridges(font_family = "Arial")+
  labs(
    x = "index (approximately 3 pages per unit)",
    y = "sentiment",
    title = "Sentiment through Federal Reserve Monetary Policy Report",
    subtitle = "customized bing lexicon",
    caption = "Source: Federal Reserve Board Monetary Policy Reports"
  )

This result shows that sentiment tended to be negative between 2001 - 2003 and 2008 - 2009, which were around the last two economic reccessions.

Let’s compute total sentiment by report.

fed_sentiment_2 = fed_text %>%
  anti_join(custom_stop_words2) %>%
  inner_join(get_sentiments("bing")) %>%
  count(report, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

fed_sentiment_2 %>%
  datatable(options = list(autoWidth = TRUE))
# Sentiment by Report
ggplot(
  fed_sentiment_2, 
  aes(factor(str_match(report, "[0-9]+")), sentiment/(negative + positive), fill = sentiment)
) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_c(option = "C") +
  theme_ridges(font_family = "Arial", font_size = 10) +
  labs(
    x = "report for July of each year",
    y = "Sentiment (>0 positive, <0 negtaive)",
    title = "Sentiment of Federal Reserve Monetary Policy Report: 1996-2022",
    subtitle = "customized bing lexicon",
    caption = "Source: Federal Reserve Board Monetary Policy Reports"
  )

2.4 Visualizing Word Correlations

We can follow Silge and Robinson and construct a graph to visualize word correlations and cluster of words. We’ll compute pairwise word correlations and then construct a graph to represent these correlations.

word_cors = fed_text_2 %>% 
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  filter(word %nin% stop_words$word) %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)

# word_cors %>% datatable(options = list(autoWidth = TRUE))
word_cors_filtered = word_cors %>%
  filter(correlation > .15)

graph_from_data_frame(word_cors_filtered) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color ="#27408b", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void(base_family = "Arial")+
  labs(
    title= "Pairs of words in Federal Reserve Monetary Policy Reports that show at\n  least a 0.15 correlation of appearing within the same 10-line section",
    caption= "Source: July Federal Reserve Board Monetary Policy Reports 1996-2022\n"
  )